9 research outputs found

    Efficient Modeling of Future Context for Image Captioning

    Full text link
    Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage two-side relation with modified mask operation, we aim to graft this advance to the conventional Autoregressive Image Captioning (AIC) model while maintaining the inference efficiency without extra time cost. Specifically, AIC and NAIC models are first trained combined with shared visual encoders, forcing the visual encoder to contain sufficient and valid future context; then the AIC model is encouraged to capture the causal dynamics of cross-layer interchanging from NAIC model on its unconfident words, which follows a teacher-student paradigm and optimized with the distribution calibration training objective. Empirical evidences demonstrate that our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations on the MS COCO benchmark. The source code is available at: https://github.com/feizc/Future-Caption.Comment: ACM Multimedia 202

    Progressive Denoising Model for Fine-Grained Text-to-Image Generation

    Full text link
    Recently, vector quantized autoregressive (VQ-AR) models have shown remarkable results in text-to-image synthesis by equally predicting discrete image tokens from the top left to bottom right in the latent space. Although the simple generative process surprisingly works well, is this the best way to generate the image? For instance, human creation is more inclined to the outline-to-fine of an image, while VQ-AR models themselves do not consider any relative importance of each component. In this paper, we present a progressive denoising model for high-fidelity text-to-image image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context in a parallel manner and this procedure is recursively applied until an image sequence is completed. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable. Extensive experiments demonstrate that the progressive model produces significantly better results when compared with the previous VQ-AR method in FID score across a wide variety of categories and aspects. Moreover, the text-to-image generation time of traditional AR increases linearly with the output image resolution and hence is quite time-consuming even for normal-size images. In contrast, our approach allows achieving a better trade-off between generation quality and speed.Comment: Technique report. arXiv admin note: text overlap with arXiv:2206.10789 by other author

    Selecting Stickers in Open-Domain Dialogue through Multitask Learning

    Full text link
    With the increasing popularity of online chatting, stickers are becoming important in our online communication. Selecting appropriate stickers in open-domain dialogue requires a comprehensive understanding of both dialogues and stickers, as well as the relationship between the two types of modalities. To tackle these challenges, we propose a multitask learning method comprised of three auxiliary tasks to enhance the understanding of dialogue history, emotion and semantic meaning of stickers. Extensive experiments conducted on a recent challenging dataset show that our model can better combine the multimodal information and achieve significantly higher accuracy over strong baselines. Ablation study further verifies the effectiveness of each auxiliary task. Our code is available at \url{https://github.com/nonstopfor/Sticker-Selection}Comment: ACL 2022 findings, camera-read

    Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning

    Full text link
    While impressive performance has been achieved in image captioning, the limited diversity of the generated captions and the large parameter scale remain major barriers to the real-word application of these systems. In this work, we propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion. To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model. In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network. Prefix-diffusion is able to generate diverse captions with relatively less parameters, while maintaining the fluency and relevance of the captions benefiting from the generative capabilities of the diffusion model. Our work paves the way for scaling up diffusion models for image captioning, and achieves promising performance compared with recent approaches.Comment: 11 pages,4 figures, 6 table

    Partially Non-Autoregressive Image Captioning

    No full text
    Current state-of-the-art image captioning systems usually generated descriptions autoregressively, i.e., every forward step conditions on the given image and previously produced words. The sequential attribution causes a unavoidable decoding latency. Non-autoregressive image captioning, on the other hand, predicts the entire sentence simultaneously and accelerates the inference process significantly. However, it removes the dependence in a caption and commonly suffers from repetition or missing issues. To make a better trade-off between speed and quality, we introduce a partially non-autoregressive model, named PNAIC, which considers a caption as a series of concatenated word groups. The groups are generated parallelly in global while each word in group is predicted from left to right, and thus the captioner can create multiple discontinuous words concurrently at each time step. More importantly, by incorporating curriculum learning-based training tasks of group length prediction and invalid group deletion, our model is capable of generating accurate captions as well as preventing common incoherent errors. Extensive experiments on MS COCO benchmark demonstrate that our proposed method achieves more than 3.5× speedup while maintaining competitive performance

    Memory-Augmented Image Captioning

    No full text
    Current deep learning-based image captioning systems have been proven to store practical knowledge with their parameters and achieve competitive performances in the public datasets. Nevertheless, their ability to access and precisely manipulate the mastered knowledge is still limited. Besides, providing evidence for decisions and updating memory information are also important yet under explored. Towards this goal, we introduce a memory-augmented method, which extends an existing image caption model by incorporating extra explicit knowledge from a memory bank. Adequate knowledge is recalled according to the similarity distance in the embedding space of history context, and the memory bank can be constructed conveniently from any matched image-text set, e.g., the previous training data. Incorporating such non-parametric memory-augmented method to various captioning baselines, the performance of resulting captioners imporves consistently on the evaluation benchmark. More encouragingly, extensive experiments demonstrate that our approach holds the capability for efficiently adapting to larger training datasets, by simply transferring the memory bank without any additional training

    Retrieve and Revise: Improving Peptide Identification with Similar Mass Spectra

    No full text
    Tandem mass spectrometry is an indispensable technology for identification of proteins from complex mixtures. Accurate and sensitive analysis of large amounts of mass spectra data is a principal challenge in proteomics. Conventional deep learning-based peptide identification models usually adopt an encoder-decoder framework and generate target sequence from left to right without fully exploiting the global information. A few recent approaches seek to employ two-pass decoding, yet have limitations when facing the spectra filled with noise. In this paper, we propose a new paradigm for improved peptide identification, which first retrieves a similar mass spectrum from the database as a reference and then revise the matched sequence according to the difference information between the referenced spectrum and current context. The inspiration of design comes that the retrieved peptide-spectrum pair provides a good start point and indirect access to both past and future information, such that each revised amino acid can be produced with better noise perception and global understanding. Moreover, a disturb-based optimization process is introduced to sharpen the attention for difference vector with reinforcement learning before fed to decoder. Experimental results on several public datasets demonstrate that prominent performance boost is obtained with the proposed method. Remarkably, we achieve new state-of-the-art identification results on these datasets

    Uncertainty-Aware Image Captioning

    No full text
    It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to construct the training pairs. During inference, we devise an uncertainty-adaptive parallel beam search technique that yields an empirically logarithmic time complexity. Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed

    Rac GTPase activating protein 1 promotes the glioma growth by regulating the expression of MCM3

    No full text
    Glioma is the most common tumor of the nervous system. The diffuse growth and proliferation of glioma poses great challenges for its treatment. Here, Transcriptomic analysis revealed that Rac GTPase activating protein 1 (RACGAP1) is highly expressed in glioma. RACGAP1 has been shown to play an important role in the malignant biological progression of a variety of tumors. However, the underlying role and mechanism in glioma remain poorly understood. By using quantitative real-time polymerase chain reaction (qRT-PCR), western blot, immunohistochemistry and Orthotopic mouse xenografts, we confirmed that knockdown of RACGAP1 impeded cell proliferation in glioma and prolonged the survival of orthotopic mice. Interestingly, we also found that inhibiting the expression of RACGAP1 reduced the expression of minichromosome maintenance 3 (MCM3) through RNA-seq and rescue assay, while Yin Yang 1 (YY1) transcriptionally regulated RACGAP1 expression. Furthermore, T7 peptide-decorated exosome (T7-exo) is regard as a promising delivery modality for targeted therapy of glioma, and the T7-siRACGAP1-exo significantly improved the survival time of glioma bearing mice. These results suggested that targeting RACGAP1 may be a potential strategy for glioma therapy
    corecore